Semi-automated annotation of page-based documents within the Genre and Multimodality framework

نویسنده

  • Tuomo Hiippala
چکیده

This paper describes ongoing work on a tool developed for annotating document images for their multimodal features and compiling this information into a corpus. The tool leverages open source computer vision and natural language processing libraries to describe the content and structure of multimodal documents and to generate multiple layers of XML annotation. The paper introduces the annotation schema, describes the document processing pipeline and concludes with a brief description of future work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Impact of Noise in Web Genre Identification

Genre detection of web documents fits an open-set classification task. The web documents not belonging to any predefined genre or where multiple genres co-exist is considered as noise. In this work we study the impact of noise on automated genre identification within an open-set classification framework. We examine alternative classification models and document representation schemes based on t...

متن کامل

Domain Specific Language in Technical Solution Documents - Discussion of Two Approaches to Improve the Semi-automated Annotation

The efficient search for existing solutions in mechanical engineering is a key-factor for successful product development. Ontology-based knowledge systems can support the semi-automated annotation of documents about existing solutions and enable the retrieval of those documents. However, the use of different wordings for similar products and a generally heterogeneous domain-specific language hi...

متن کامل

Ontea: Platform for Pattern Based Automated Semantic Annotation

Automated annotation of web documents is a key challenge of the Semantic Web effort. Semantic metadata can be created manually or using automated annotation or tagging tools. Automated semantic annotation tools with best results are built on various machine learning algorithms which require training sets. Other approach is to use pattern based semantic annotation solutions built on natural lang...

متن کامل

Towards Automatic Web Genre Identification: A Corpus-Based Approach in the Domain of Academia by Example of the Academic's Personal Homepage

We argue for a systematic analysis of one particular, well structured domain—academic Web pages—with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3 000 000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type ...

متن کامل

IJEL 4/1 page layout

Instructional text, and procedural text in particular, is a genre that users heavily rely upon when they are learning new procedures, devices or systems. It is, however, also well-known to be a genre that is difficult to produce and maintain. This article discusses Isolde, an environment that attempts to address this problem by supporting the semi-automated production of procedural instructions...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016